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Abstract 

Background: Granzyme B is a serine protease which cleaves at unique tetrapeptide sequences. It is involved in 
several signaling cross-talks with caspases and functions as a pivotal mediator in a broad range of cellular 
processes such as apoptosis and inflammation. The granzyme B degradome constitutes proteins from a myriad of 
functional classes with many more expected to be discovered. However, the experimental discovery and validation 
of bona fide granzyme B substrates require time consuming and laborious efforts. As such, computational methods 
for the prediction of substrates would be immensely helpful. 

Results: We have compiled a dataset of 580 experimentally verified granzyme B cleavage sites and found 
distinctive patterns of residue conservation and position-specific residue propensities which could be useful for in 
silico prediction using machine learning algorithms. We trained a series of support vector machines (SVM) classifiers 
employing Bayes Feature Extraction to predict cleavage sites using sequence windows of diverse lengths and 
compositions. The SVM classifiers achieved accuracy and A RO c scores between 71.00% to 86.50% and 0.78 to 0.94 
respectively on independent test sets. We have applied our prediction method on the Chikungunya viral proteome 
and identified several regulatory domains of viral proteins to be potential sites of granzyme B cleavage, suggesting 
direct antiviral activity of granzyme B during host-viral innate immune responses. 

Conclusions: We have compiled a comprehensive dataset of granzyme B cleavage sites and developed an 
accurate SVM-based prediction method utilizing Bayes Feature Extraction to identify novel substrates of granzyme 
B in silico. The prediction server is available online, together with reference datasets and supplementary materials. 



Background 

Proteolysis - the specific and limited cleavage of proteins 
by enzymes called proteases - represents an important 
mechanism for post-translational control in all living 
organisms [1]. Granzymes (short for granule enzymes) 
belong to a unique class of serine proteases which are 
known to mediate critical roles in the innate immune 
response against virus-infected or tumor cells through 
the induction of apoptotic cell death [2]. Consequently, 
the enzymes have been implicated in the pathogenesis 
of several chronic inflammatory and cardiovascular dis- 
orders. Granzymes are released into the cytoplasm of 
the target cells through endocytosis of cytolytic granules 
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released by cytotoxic T cells or natural killer cells [2]. 
Once released into the target cells, granzymes go on to 
cleave specific cellular proteins and activate multiple sig- 
naling pathways leading to apoptotic cell death. Of the 
five human subtypes discovered to date (granzymes A, 
B, H, K and M), granzyme B has been the most well 
studied. Like caspases, granzyme B recognizes specific 
tetrapeptide sequence motifs (P4-P3-P2-P1) and cleave 
proteins after aspartate residue at Pi [3,4]. Besides cleav- 
ing specific proteins regulating apoptotic cell death, 
granzyme B has been reported to cleave proteins across 
a wide spectrum of other functional classes, ranging 
from nuclear and cytoskeletal components to membrane 
receptors and viral proteins [5]. 

To date, more than 500 granzyme B substrates have 
been characterized and many more are expected to be 
identified [5]. While systematic experimental discovery 
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and validation of bona fide substrates are necessary for 
elucidating the granzyme B degradome, many of the 
processes are often time consuming and laborious. For 
these reasons, computational prediction of substrates 
could be immensely helpful in generating initial hypoth- 
eses and experimental leads. While a wide range of 
computational methods have been applied for substrate 
prediction of related proteases such as caspases [6,7], 
only a limited number are available for prediction of 
granzyme B substrates. PeptideCutter [8] is a general 
protease substrates cleavage prediction server which pre- 
dicts for potential granyzme B cleavage sites using pre- 
ferential tetrapeptide cleavage (P4-P3-P2-P1) specificities 
derived from in vitro combinatorial library studies by 
Thornberry et al. [4]. Backes et al. developed the GraB- 
Cas software which extended the use of the in vitro spe- 
cificities by incorporating position-specific scoring 
matrices and accounting for conserved residues at P/ 
and P 2 ' positions [9]. More recently, Barkan et al. 
advanced the field through the application of the sup- 
port vector machines (SVM) method on a set of experi- 
mentally verified cleavage sites using both sequence and 
structural features [10]. 

In this paper, we have compiled a dataset of 580 
experimentally verified granzyme B cleavage sites and 
found distinctive patterns of residue conservation and 
position-specific residue propensities which could be 
useful for in silico prediction using machine learning 
algorithms. We trained a series of SVM classifiers 
employing Bayes Feature Extraction to predict cleavage 
sites using sequence windows of diverse lengths and 
compositions. The SVM classifiers achieved accuracy 
and A ROC scores between 71.00% to 86.50% and 0.78 to 
0.94 respectively on independent test sets. We applied 
our prediction method on the Chikungunya viral pro- 
teome and identified several regulatory domains of viral 
proteins to be potential sites of granzyme B cleavage, 
suggesting direct antiviral activity of granzyme B during 
host-viral innate immune responses. A web server, 
together with reference datasets and supplementary 
materials, can be accessed at http://www.casbase.org/ 
grasvm/index.html. 

Results and discussion 

Sequence analysis of granzyme B cleavage sites 

Using peptide combinatorial libraries, Thornberry and 
co-workers had previously identified the presence of dis- 
tinctive sequence specificities governing protein cleavage 
of both caspase and granzyme B substrates [4] . In parti- 
cular, specific tetrapeptide sequences upstream of the 
cleavage site (P4-P3-P2-P1) of granzyme B targets serve 
as recognition sites for protein cleavage. The tetrapep- 
tide "IEPD" was identified as the optimal tetrapetide 
cleavage sequence in vitro. However, emerging data on 



granzyme B substrates suggest that the in vivo cleavage 
specificities are far more diverse, with numerous sub- 
strates possessing cleavage specificities extending beyond 
the tetrapeptide sequence [5,10]. 

We compiled a comprehensive dataset of 580 unique 
granzyme B cleavage sites extracted from experimentally 
verified substrates as reported in literature. Data was 
extracted from the substrates list compiled in Barkan et 
al. [10], as well as the proteomic studies by Van 
Damme et al. [5]. In addition to the P 4 Pi cleavage site 
sequences, segments of different lengths and composi- 
tions centered on the P x position were selected. In all, 
eight groups of sequences were obtained - P2P2 > V^P^ 
P 4 P 2 , P4P40 ^6^6 > ^8^8 > PioPio' an d Pi4Pio - 

We further 

extracted an equal number of "non-cleavage" sites by 
randomly selecting non-annotated tetrapeptide 
sequences (and other corresponding sequence segments) 
on the substrates. On the PioPio' dataset, we computed 
P x (or relative position-specific residue propensity) of 
each amino acid at the different residue positions along 
the 20-mer sequence. P x was computed as the ratio of 
the frequency of occurrence of a particular residue in 
the cleavage site sequences over the same residue in the 
non-cleavage site sequences at the particular position. 

As shown in Table 1, measurements of average P x in 
the PioPio' sequences indicate an unusually high enrich- 
ment for the negatively charged amino acids Asp and 

Table 1 Average P x of amino acids: Average P x of each 
amino acid was calculated by averaging the P x values of 
the particular amino acid across all residue positions 
within the 20-mer sequence window (P 10 Pio) 



Amino acid Average P x 



A 


1.14 


C 


0.72 


D 


1.98 


E 


1.46 


F 


0.80 


G 


1.02 


H 


0.48 


1 


1.05 


K 


0.69 


L 


0.88 


M 


1.10 


N 


0.86 


P 


0.93 


Q 


0.96 


R 


0.66 


S 


1.07 


T 


0.96 


V 


1.08 


W 


0.46 


Y 


0.80 
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Glu with average P x scores of 1.98 and 1.46 respectively. 
Conversely, there are significantly lower propensities for 
the positively charged amino acids (His, Lys and Arg all 
possess average P x of less than 0.70). In addition, the 
large hydrophobic residue Trp is also weakly repre- 
sented among the cleavage site sequences, with average 
P x of 0.46. To further quantify position-specific residue 
propensities, we plotted a sequence logo using the 
PioPio' sequences and constructed a heatmap of P x 
scores from the same dataset (as shown in Figures 1 
and 2 respectively). At Pi position, Asp is expectedly the 
most conserved residue, with notable presence of Glu, 
Asn and Ser as alternatives. Interestingly, Pro and Cys 
residues are more conserved in the cleavage sites com- 
pared to the non-cleavage sites at P 2 position, while P 3 
is dominated by the acidic residues Asp and Glu. The 
P 4 position showed significant propensities for the 
branched-chain amino acids Leu, He and Val. Remark- 
ably, the most prominent feature distinguishing cleavage 
site sequences from non-cleavage site sequences appear 
to be the extended stretches of acidic residues (Asp and 
Glu) upstream and downstream of the cleavage site. 
Downstream of the cleavage site, it is further observed 
that small amino acids such as Gly, Ser, Ala and Leu are 
highly enriched at Pi and P 2 . These results indicate that 
cleavage sites of granzyme B substrates and the flanking 
upstream and downstream sequences have unique posi- 
tion-specific residue propensities. These composite sig- 
natures could be incorporated into machine learning 
algorithms for the development of accurate computa- 
tional prediction models. 

SVM prediction of granzyme B cleavage sites 

To account for these unique signatures of residue con- 
servation and position-specific propensities for in silico 
prediction, we developed SVM prediction models 




Figure 1 Sequence logo of amino acids in the vicinity of the 

granzyme B cleavage site (P 10 to P 10 ) 

\ / 




Pio p 9 p 8 p 7 p 6 p 5 p* Ps p 2 Pi Pi p 2 ' p 3 ' p; p 5 ' p 6 p; p 8 ' p 9 ' P10' 



Figure 2 Heat map of relative position-specific amino acid 
propensities (P x ). P x values were computed for P 10 P 10 dataset. P x 
values were computed as the ratio of the frequency of occurrence 
of the amino acid in the cleavage sites pool over the frequency of 
occurrence of the same amino acid in the non-cleavage sites pool 
at a specific position. Increasing color intensities (white to blue) 
indicate proportionately greater enrichment of the amino acid in 
the cleavage sites over non-cleavage sites, and vice versa for 
decreasing color intensities. 

V ) 

incorporating the Bayes Feature Extraction (BFE) 
approach as described in Shao et al\ll\. Vector repre- 
sentation using the BFE approach was shown to signifi- 
cantly improve performance in several bio- 
computational problems - such as the prediction of pro- 
tein methylation sites [11], caspase cleavage [12] and lin- 
ear B-cell epitopes [13] - over simple binary encoding 
schemes. In BFE, feature vectors encoded in a bi-profile 
manner comprising of positive position-specific and 
negative position-specific profiles. These profiles were 
generated by accounting for the frequency of occurrence 
of each amino acid at each position of the sequences in 
the positives pool (cleavage site sequences) and nega- 
tives pool (non-cleavage site sequences) respectively. 
Here, we trained a series of SVM classifiers on sequence 
windows of diverse lengths and compositions i^^'i, 
P4P1, P4P2 , P4P4, PePe, PsP^ PioPio' and P 14 Pio') using 
simple binary encoding and BFE schemes (details in 
Materials and Methods). Datasets were segmented into 
training and independent test sets comprising of 480 
positives/480 negatives and 100 positives/ 100 negatives 
respectively. Using the RBF kernel, 10-fold cross- 
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validation was implemented to acquire the optimal set 
of C and y parameter values. SVM classifiers were sub- 
sequently trained on the entire training set using the 
optimized parameters and evaluated on the independent 
test sets. 

As given in Table 2, the P 4 P! classifier utilizing simple 
binary encoding (P 4 P 1 -SVM) registered an accuracy of 
77.50% and A RO c of 0.77 on independent testing. The 
other classifiers observed consistent improvement in 
accuracy and A RO c as the sequence window extends 
beyond P4P1 to include the flanking upstream and 
downstream residues, achieving the best scores of 
83.50% and 0.89 respectively with the P 8 P 8 -SVM classi- 
fier. The P4P1 classifier utilizing BFE scheme (P4P1- 
Bayes) attained an accuracy of 76.50% and A RO c of 0.84 
(Table 3). In a similar fashion, prediction performance 
improved steadily as the sequence window is extended 
beyond P 4 Pi, achieving the best accuracy of 86.50% with 
the P 8 P 8 -Bayes classifier and the best A RO c of 0.94 with 
the P 10 P 10 -Bayes and P 14 P 10 -Bayes classifiers. Interest- 
ingly, in both feature representation schemes, prediction 
performances did not significantly improve with 
sequences longer than P 8 P 8 . This could be due to that 
fact that much of the information specific for differen- 
tiating cleavage sites from non-cleavage sites are 
encoded within the sequences situated closer to the 
cleavage sites, as evidenced by the unique residue pro- 
pensities discussed earlier. In addition, accuracy and 
Aroc scores across most sequence lengths and composi- 
tions were generally higher for classifiers trained using 
the BFE scheme, with the greatest improvements 
observed when longer sequences (PeP6> PsPs > P10P10 
and P14P10) were employed. 

Next, we compared our prediction method with GraB- 
Cas [9] and the SVM models developed by Barkan et al 
[10]. As the GraBCas algorithm primarily focuses on the 
detection of specific tetrapeptide motifs, we applied the 
algorithm on our P 4 Pi independent test set which con- 
tains only the tetrapeptide cleavage site sequences. 
Using the recommended cut-off score of 0.12, GraBCas 
predicted only 61 out of 100 cleavage sites correctly 



Table 2 Results of SVM prediction using simple binary 
encoding 



SVM classifier 


Sensitivity (%) 


Specificity (%) 


Accuracy (%) 


Aroc 


P 2 P 2 -SVM 


73.00 


68.00 


70.50 


0.77 


P4P1-SVM 


77.00 


78.00 


77.50 


0.85 


P4P2-SVM 


85.00 


76.00 


80.50 


0.89 


P4P4-SVM 


84.00 


80.00 


82.00 


0.89 


P 6 P 6 -SVM 


84.00 


82.00 


83.00 


0.89 


P 8 P 8 -SVM 


83.00 


84.00 


83.50 


0.89 


P10P10-SVM 


81.00 


82.00 


81.50 


0.89 


P 14 Pio'-SVM 


78.00 


81.00 


79.50 


0.88 



Table 3 Results of SVM prediction using Bayes Feature 
Extraction 



SVM classifier 


Sensitivity (%) 


Specificity (%) 


Accuracy (%) 


Aroc 


P 2 P 2 -Bayes 


71.00 


71.00 


71.00 


0.78 


P4P1 -Bayes 


79.00 


74.00 


76.50 


0.84 


P 4 P 2 '-Bayes 


82.00 


80.00 


81.00 


0.89 


P4P4'-Bayes 


82.00 


81.00 


81.50 


0.91 


P 6 P 6 -Bayes 


86.00 


84.00 


85.00 


0.91 


P 8 P 8 -Bayes 


89.00 


84.00 


86.50 


0.93 


P 10 Pio'-Bayes 


87.00 


85.00 


86.00 


0.94 


P 14 P 10 '-Bayes 


88.00 


82.00 


85.00 


0.94 



(S„=61%). On the same dataset, our P4P1- SVM and 
P 4 P 1 -Bayes classifiers respectively predicted 77 out of 
100 (S n =77%) and 79 out of 100 (S n =79%) cleavage sites 
correctly. The weaker sensitivity scores observed for 
GraBCas could be due to the utilization of position-spe- 
cific scoring matrices (PSSMs) which are derived from a 
small, out-dated set of in vitro cleavage specificities and 
the absolute requirement of Asp residue at V x on the 
cleavage sites. To further evaluate the performance of 
the PSSM-based algorithm in our context, we con- 
structed PSSMs derived from our entire dataset of clea- 
vage sites, and found that the A RO c scores of the PSSM- 
based predictors were generally poorer than our SVM- 
based classifiers (data not shown). In Barkan et al, the 
best SVM classifier recorded a true positive rate (TPR) 
of 0.79 and false positive rate (FPR) of 0.21 at the criti- 
cal point on the receiver operating characteristic (ROC) 
curve when tested on an independent test set. In our 
SVM method, several classifiers encoded using the BFE 
scheme registered better prediction performance when 
measured by the same metrics; P10P10 -Bayes with TPR 
of 0.86 and FPR of 0.14, as well as P 14 P 10 '-Bayes, P 8 P 8 - 
Bayes and P 6 P 6 -Bayes with TPRs of 0.85 and FPRs of 
0.15. 

Prediction of granzyme B cleavage of CHIKV proteome 

To investigate the applicability of our computational 
method, we applied the SVM classifiers on the proteome 
of the Chikungunya virus (CHIKV) and analyzed for the 
presence of hitherto undiscovered granzyme B cleavage 
sites. CHIKV is a member of the alphavirus family and 
has been known to be transmitted to humans via the 
bite of the virus-borne Aedes mosquito [14]. Acute 
infection of CHIKV results in symptoms such as abrupt 
fever, skin rash and arthralgia. As CHIKV epidemics 
have been re-emerging in recent times, there have been 
concerted efforts directed toward developing relevant 
vaccines and drug therapies. During viral infections, 
granzyme B has been reported to mediate downstream 
cleavage of critical host regulatory proteins, leading to 
the induction of the apoptotic cell death, and hence 
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disruption of viral propagation [15]. Although granzyme 
B-induced apoptotic cell death has long been considered 
the de facto mechanism for killing virus-infected cells, 
emerging evidence suggest that the enzyme could exert 
direct antiviral activity through cleavage of the viral pro- 
teins [15]. For these reasons, it is intuitive to speculate if 
the CHIKV proteome may be directly regulated by gran- 
zyme B activity in this manner and if cleavage of specific 
CHIKV proteins will potentiate the host innate immune 
responses against viral infectivity. 

Four non-structural and four structural proteins of the 
CHIKV proteome (strain: LR2006_OPY1) were predicted 
for granzyme B cleavage sites using the PsPg'-Bayes clas- 
sifier. Since the majority of experimentally verified clea- 
vage sites were known to be cleaved after the Asp 
residue, we have restricted our prediction scans to only 
cleavage sites containing Asp residue at P x . As shown in 
Table 4, we found potential granzyme B cleavage sites 
in all CHIKV proteins except the structural proteins El, 
E3 and 6K. A significantly larger proportion of these 
sites were found in the non-structural proteins NSP1, 
NSP2, NSP3 and NSP4, as compared to the structural 
proteins E2 and capsid. As the alphaviral non-structural 
proteins are known to be involved in viral survival and 
replication, we would expect the cleavage of these pro- 
teins by granzyme B to abrogate viral survival mechan- 
isms at different points of the viral reproduction cycle 
[16]. Indeed, the cleavage of NSP1 protein at Asp-11 
and Asp-58, which are both localized within the methyl- 
transferase domain, could lead to inhibition of the 
mRNA capping during RNA synthesis. Conversely, the 
cleavage of NSP2 helicase domain at Asp-247 and Asp- 
343, as well as the RNA polymerase domain at Asp-291, 
Asp-371, Asp-476 and Asp-540 on the NSP4 protein 
could hinder viral RNA synthesis and translation. In 
addition, cleavage of the capsid protein at Asp- 112 and 
Asp- 114 within the protease domain might lead to pre- 
vention of auto-cleavage of the immature capsid protein 
from the viral structural polyprotein. 



Conclusions 

In this paper, we constructed a comprehensive database 
of experimentally verified granzyme B cleavage sites for 
analysis and development of prediction methods. We 
discovered that flanking sequences of cleavage sites pos- 
sess distinctive residue composition and position-specific 
propensity patterns which could be helpful in discrimi- 
nating the cleavage sites from non-cleavage sites in 
silico. We have rigorously tested SVM classifiers 
employing simple binary encoding and the Bayes Fea- 
ture Extraction schemes to predict granzyme B cleavage 
sites. Results also show that the best classifiers are more 
effective than existing algorithms. We applied our pre- 
diction method on the Chikungunya viral proteome and 
identified several regulatory domains of viral proteins to 
be potential targets of granzyme B cleavage, suggesting a 
direct antiviral function of granzyme B during host-viral 
innate immune responses. To complement experimental 
research, we have implemented our prediction method 
on a web server which is freely accessible at http://www. 
casbase.org/grasvm/index.html. In the immediate future, 
we will be exploring the influence of cleavage site sec- 
ondary structures, solvent accessibilities and other physi- 
cochemical properties on protease-substrate cleavage 
specificities, as well as their potential for enhancing the 
performance of our SVM prediction models. Computa- 
tional prediction of granzyme B substrates will comple- 
ment on-going experimental efforts and refine our 
understanding of the biochemistry of this fascinating 
protease and its relatives. 

Materials and methods 

Datasets 

We extracted a pool of 779 unique, experimentally veri- 
fied cleavage sites from literature. 723 sequences were 
derived from proteomic experimental studies conducted 
by Van Damme et al. [5], with the remaining 56 from 
systematic in vitro and in vivo experiments as compiled 
in Barkan et al. [10]. We further extracted sequence 



Table 4 Prediction of granzyme B cleavage of CHIKV proteome 


Protein 


Biological activity and function 


Cleavage sites* 


NSP1 


Non-structural: mRNA capping 


9, a 58, 525 


NSP2 


Non-structural: NTPase, helicase and protease activities 


116, 247, 343 


NSP3 


Non-structural: ADP-ribose phosphatase activity 


181, 350, 363, 506 


NSP4 


Non-structural: RNA polymerase activity 


219, 371, 476, 540 


E1 


Structural: virus-host cell fusion 


Nil 


E2 


Structural: virus-host cell attachment 


77 


E3 


Structural: unknown 


Nil 


Capsid 


Structural: protease, viral nucleocapsid formation 


112, 174 


6K 


Structural: membrane permeabilization, budding of viral particles 


Nil 



"Position of the P q residue on the substrate. All predicted cleavage sites contain Asp at P^ Underlines indicate P q location in the functional domain(s) of protein. 
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segments of different lengths flanking the T> 1 cleavage 
sites. In all, eight datasets were constructed: the tetra- 
peptide cleavage site sequences (referred to as P 4 P! 
dataset) and sequences containing residues extended to 
P 14 and P 10 ' (P 2 P 2 , P 4 P 2 , P4P4 , PePe', P 8 P 8 > PioPio' and 
PmPio' datasets). These sequences were assigned as posi- 
tive examples for analysis as well as for development of 
the SVM method. An equal number of "non-cleavage 
sites" or negative examples were obtained by randomly 
extracting P x residues on the substrates. Sequence seg- 
ments of the aforementioned lengths and compositions 
were obtained as detailed earlier. All datasets of positive 
and negative sequences (779/779) were subsequently 
subjected to homology filtering using the CD-HIT clus- 
tering algorithm [17] where sequences bearing more 
than 85% sequence identity with any other sequence in 
the dataset were eliminated. The final datasets com- 
prised of 580 positive and 580 negative sequences (the 
complete list of cleavage sites is available in Additional 
File 1). For analysis, all 580 positives and 580 negatives 
from the PioPio' dataset were used. For SVM model 
development, datasets were partitioned into training and 
test sets consisting of 480 positives/480 negatives and 
100 positives/ 100 negatives respectively. 



positives pool (cleavage site sequences) and negatives 
pool (non-cleavage site sequences) respectively. There- 
fore, a 20-mer sequence (from the PioPio' dataset) would 
be represented by a feature vector of 40 dimensions (20 x 
2), containing information of the residues in both positive 
(cleavage site sequences) and negative (non-cleavage site 
sequences) spaces. For all sequence representations, P x 
residues were excluded from the feature vectors. 

SVM model development 

To train and test the SVM models, we used the LIBSVM 
package provided by Chang and Lin [19]. For details on 
the SVM method, readers are advised to consult the article 
by Burges [20]. In short, SVM is grounded on the struc- 
tural risk minimization concept from statistical learning 
theory. A set of training examples (positives and negatives) 
can be encoded by the feature vectors x { (i = 1, 2,....N ) 
with resultant classes y t e {+1,-1}. The SVM algorithm 
trains a classifier by representing the input feature vectors, 
using a kernel function in the majority of cases, onto a 
high-dimensional space, and then selects a discriminating 
hyperplane that separates the two classes with maximal 
margin and the least error. The decision function for clas- 
sification of unseen examples is defined as: 



Sequence analysis 

The relative position-specific residue propensity P x was 
computed as the ratio of the frequency of occurrence of 
a particular amino acid in the cleavage sites pool to its 
frequency of occurrence in the non-cleavage sites pool 
at a specific position on the sequence. Using the PioPio' 
dataset, P x scores were calculated for every amino acid 
at each of the twenty residue positions and visualized on 
heat maps. Additionally, we constructed a sequence logo 
representation of the positive sequences from the P10P10 
dataset using WebLogo [18]. 

SVM vector representation 

To encapsulate sequence information for SVM training 
and testing, input vectors were constructed using simple 
binary or bi-profile Bayes Features encoding. For simple 
binary encoding, each amino acid is represented by a vec- 
tor of 20 dimensions, comprising of binary values of zer- 
oes and ones. For example, alanine was represented as 
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1] and cysteine as 
[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]. Hence, in this case, 
a 20-mer sequence will be represented by a vector of 400 
dimensions (20 x 20). Detailed description on bi-profile 
vector encoding using Bayes Features is available in Shao 
et al [11]. In short, feature vectors contain information 
from both positive position-specific and negative posi- 
tion-specific profiles. These profiles were generated by 
accounting for the frequency of occurrence of each 
amino acid at each position of the sequences in the 



( n 



f(x) = sign 



i=i 



where K (x^Xj ) is the kernel function, and the para- 
meters are resolved by maximizing the following: 



N N 



X af -\^^ a i a iyiy) K ( x i' x i) 

i=l i=l j=l 

with the following constraints: 



■■ 0 and 0 < a, < C 



i=i 



C is the regularization variable that directs the trade- 
off between margin and classification error. We used 
the radial basis function (RBF) kernel and performed 
grid-based optimization for y, which controls the capa- 
city of the RBF kernel, and C using 10-fold cross-valida- 
tion. In 10-fold cross-validation, the training set was 
randomly partitioned into ten subsets where one of the 
subsets was used as the test set while the other subsets 
were used for training the classifier. The trained classi- 
fier was evaluated using the test set. This procedure was 
repeated ten times using different subsets for testing, 
hence making sure that all subsets were utilized for 
both training and testing. The optimized y and C values 
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were applied towards training the entire training set to 
generate the SVM classifier for independent testing on 
an out-of-sample test set. Graphical plots of optimiza- 
tion results are provided in Additional File 2. 

Evaluation of model performance 

A set of statistical variables were established to evaluate 
the performance of the SVM classifier for the prediction 
of granzyme B cleavage sites: 

(i) True Positives (TP), for the number of correctly 
classified cleavage sites. 

(ii) False Positives {FP), for the number of incorrectly 
classified non-cleavage sites. 

(iii) True Negatives (TN), for the number of correctly 
classified non-cleavage sites. 

(iv) False Negatives (FN), for the number of incor- 
rectly classified cleavage sites. 

Sensitivity (S n ) and Specificity (S p ), which measures 
the capability of the model to correctly classify the clea- 
vage sites and non-cleavage sites respectively, were com- 
puted as well: 

s TP 

n TP + FN 

S ™ 
p TN + FP 

To measure the overall model performance, we com- 
puted Accuracy (A cc ): 

TP + TN 

A — 

cc TP + FN + TN + FP 

In addition, we plotted the receiver operating charac- 
teristic curve (ROC) and computed the area under the 
curve (A RO c) f° r threshold independent evaluation. To 
compare against the prediction model developed by Bar- 
kan et al, we further determined the critical points on 
the ROCs of our SVM classifiers, which are defined as 
the points where the ROC curves intersect the lines 
connecting coordinates (1, 0) and (0, 1) on the graphs. 

Additional material 
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